12: LASSO

Due Date

This assignment is due on Monday, April 13th

All assignments are due on D2L by 11:59pm on the due date. Late work is not accepted. You do not need to submit your .rmd file - just the properly-knitted PDF. All assignments must be properly rendered to PDF using Latex. Make sure you start your assignment sufficiently early such that you have time to address rendering issues. Come to office hours or use the course Slack if you have issues. Using an Rstudio instance on posit.cloud is always a feasible alternative. Remember, if you use any AI for coding, you must comment each line with your own interpretation of what that line of code does.

Oh no. Really? Ames again?

Yes, Ames again. Let’s predict some SalePrices!

Ames <- read.table('https://raw.githubusercontent.com/ajkirkpatrick/FS20/postS21_rev/classdata/ames.csv', 
                   header = TRUE,
                   sep = ',') %>%
  dplyr::select(-Id)

Data cleaning

Repeat the data cleaning exercise from last week’s lab. The point is to make sure that every observation is non-NA and all predictor variables have more than one value. Use skimr::skim on Ames to find predictors with only one value or are missing many values. Take them out, and use na.omit to ensure there are no NA values left. Check to make sure you still have at least 800 or so observations!

Predicive model

For the assignment below, we’ll use glmnet::cv.glmnet to estimate a LASSO model. Note that you’re asked to state 16 predictors and 5 interactions. You can go beyond this. Unlike our linear model building, complexity in LASSO is not controlled by writing out a bunch of formulas with more terms. It’s in the lambda parameter. So we write one formula and let lambda vary.

Exercise 1 of 1

Clean your data as described above.
Choose up to 16 predictor variables and clean your data so that no NA values are left
Choose at least 5 interactions between your predictor variables and print out the formula you’ll use to predict SalePrice.
In your code, use set.seed(24224) so that your results will always be the same. Why do we need to set seed? When we (well, glmnet::cvglmnet) makes the Train and Test sample(s), it’ll select them randomly. If you don’t set seed, every time you run it, you’ll get slightly different answers!
Using glmnet::cv.glmnet to estimate a LASSO model (see lecture notes this week) that predicts SalePrice given the observed data and using your formula. Slide 33 shows cross-validation using both alpha and lambda – a LASSO model holds alpha fixed at alpha = 1. We’ll search using lambda as our tuning parameter. Call the resulting object net_cv.

To do this, you’ll have to make a matrix to give to cv.glmnet in the x argument because it doesn’t take a formula. You can use model.matrix() to create the matrix. Use that matrix as your x. It will not add the SalePrice variable to the x matrix – you just have to give it SalePrice as the y variable.

The resulting object will be a glmnet object. You can see the optimal lambda just by printing the object print(net_cv) and looking at the min value for Lambda. Make sure the optimal (RMSE-minimizing) value of lambda is not the largest or smallest value of lambda you gave it. If it is, then extend the range of lambdas until you get an interior solution. Following the instructions from our lecture note’s TRY IT, extract the lambdas and their respective RMSE values into a data.frame and make a plot similar to the RMSE plot from lecture.
Answer the following question: What is the optimal lambda based on the plot/data? Do you see a minimum point in the plot?
Extracting the non-zero coefficients is a little tricky, but let’s do it. We’ll use the coef function to extract the coefficients. The coef function, when used on a glmnet object, takes the argument s which is the lambda value for which you’d like to extract coefficients. Our s value should be the best value of lambda, which we can extract from net_cv$lambda.min. Put those together: coef(net_cv, s = net_cv$lambda.min). This may be kinda long, that’s OK.